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ABSTRACT 

Gene-editing nucleases enable targeted modification 
of DNA sequences in living cells, thereby facilitating 
efficient knockout and precise editing of endogenous 
loci. Engineered nucleases also have the potential to 
introduce mutations at off-target sites of action. 
Such unintended alterations can confound interpret- 
ation of experiments and can have implications for 
development of therapeutic applications. Recently, 
two improved methods for identifying the off-target 
effects of zinc finger nucleases (ZFNs) were 
described-one using an in vitro cleavage site selec- 
tion method and the other exploiting the insertion of 
integration-defective lentiviruses into nuclease- 
induced double-stranded DNA breaks. However, ap- 
plication of these two methods to a ZFN pair targeted 
to the human CCR5 gene led to identification of 
largely non-overlapping off-target sites, raising the 
possibility that additional off-target sites might 
exist. Here, we show that in silico abstraction of 
ZFN cleavage profiles obtained from in vitro 
cleavage site selections can greatly enhance the 
ability to identify potential off-target sites in human 
cells. Our improved method should enable more 
comprehensive profiling of ZFN specificities. 

INTRODUCTION 

Gene-editing nucleases, such as zinc finger nucleases 
(ZFNs), transcription activator-like effector nucleases 



(TALENs) and clustered regularly interspaced short pal- 
indromic repeats (CRISPR)/CRISPR-associated (Cas) 
nucleases, can be used to create targeted sequence alter- 
ations with high efficiencies in numerous cell types and 
organisms (1-7). Repair of nuclease-induced double- 
stranded breaks can be exploited to introduce either 
insertion/deletion (indel) mutations via non-homologous 
end-joining (NHEJ) or specific sequence alterations from 
a donor template via homology-directed repair (1,2). 
Comprehensive delineation of unintended off-target 
mutations is important for customized nucleases in 
many biological applications and will be essential for de- 
veloping therapeutic strategies based on these proteins. 

Two different methods have recently been described for 
characterizing the genome-wide specificities of ZFNs (8,9), 
but neither study comprehensively identified the full- 
spectrum of possible off-target mutations. One method, 
previously developed by Liu and colleagues, used an 
in vitro cleavage site selection assay to identify sequences 
from a large partially degenerate library (based on the 
intended target DNA site) that can be cleaved by ZFNs. 
In vitro selections with a CCR5-targeted ZFN pair 
identified 36 potential off-target cleavage sites that occur 
in the sequence of the human genome; analysis of these 
sites in human cells in which CCR5-targeted ZFNs had 
been expressed revealed nine bona fide off-target sites (8). 
Another approach, described by von Kalle and colleagues, 
exploited the incorporation of integrase-deficient Anti- 
virus (IDLV) DNAs into nuclease-induced double- 
stranded breaks to map ZFN cleavage sites in human 
cells (9). Application of this approach to the same 
CCR5-targeted ZFNs characterized with the in vitro selec- 
tion approach identified four off-target genomic sites. 
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However, the substantial lack of overlap between off- 
target sites identified in these two studies (only one site 
was common to both sets) strongly suggested that neither 
identified all possible off-target sites. In addition, these 
results also suggested that a broader range of potential 
off-target sites might exist beyond the sets identified by 
these two methods. 

Here, we show that in silico abstraction of ZFN 
cleavage profiles generated by the selection method of 
Liu and colleagues provides an improved approach to 
screen the human genome for potential ZFN off-target 
sites. This enhanced strategy identifies both previously 
described as well as dozens of additional off-target sites 
for a ZFN pair targeted to CCR5 gene. We also show that 
this improved method works effectively for another ZFN 
pair targeted to the VEGFA gene. Our results demonstrate 
that the potential landscape of off-target mutagenesis 
effects for ZFNs may be broader than delineated in 
previous studies. 

MATERIALS AND METHODS 

Plasmids 

The plasmids encoding ZFNs targeted to sites in the 
human CCR5 (10) and VEGFA (11) genes were modified 
to include heterodimeric EL/KK Fokl mutations (12) and 
were constructed as described in Pattanayak et al. (8) 
(Supplementary Figure SI). 

Processing of in vitro selection data 

Sequence reads from the in vitro cleavage assay reported 
by Pattanayak were used to generate nucleotide windows 
comprising the core 9 bp (VEGFA) and 12 bp (CCR5) zinc 
finger recognition sites as well as the adjacent nucleotides 
for each ZFN half-site. Sequences shown to cleave effi- 
ciently in vitro were considered active. The preselection 
library sequences minus those seen in the active set were 
considered not efficiently cleaved and labeled as the 
inactive class. Duplicate entries were removed unless 
they were identified as independent cleavage events 
either by experiment or sequence variation in the spacer. 
Several classifiers including SVMs, decision trees and 
Naive Bayes were tested in 10-fold cross-validation 
analyses using WEKA v3.5.7 (13). Naive Bayes performed 
as well or better than the rest of the classifiers and was 
used exclusively going forward in this study. The test set 
was built from human genome build HG36 was parsed 
into similar windows using spacers of 5 and 6 nt separating 
the two zinc finger half-sites. 

Validation of putative cleavage sites 

Individual windows are scored 0 to 1 with windows of 
lower scores representing sequences that are more likely 
to be cleaved by the CCR5 ZFNs. K562 cells were treated 
with catalytically active CCR5-224 ZFNs or a vector-only 
control, genomic DNA was harvested, and deep 
sequencing was used to analyze loci of interest as 
described in Pattanayak et al. (8), with the exception 
that sequencing was carried out for each paired-end 



library with a 150-cycle MiSeq run (Illumina; Harvard 
Biopolymers Facility, Boston, MA for CCR5 samples 
and Dana Farber Cancer Institute for VEGFA samples). 
Oligonucleotides used to amplify genomic loci of interest 
are listed in Supplementary Table S10. 

Data processing to identify putative mutagenic 
NHEJ events 

Individual reads were mapped using primer sequences to 
the individual amplicons and aligned using the 
Needleman-Wunsch algorithm with affine gap penalties 
(14). Alignments with <40bp (minimum combined 
length of primer) to the reference were excluded, and 
targets with < 1 500 reads in either the treated or untreated 
samples were excluded. Individual alignments were 
combined to generate a multiple sequence alignment. 
Identical alignments were counted, condensed and 
verified to map HG37.57 using BLAT (http://www. 
kentinformatics.com/). Sequences that mapped preferen- 
tially to an alternate target were excluded. Potential NHEJ 
events required indels of at least 2nt in length that 
originated from within the spacer between the ZFN half- 
sites. 



RESULTS 

We sought to improve the original strategy of Liu and 
colleagues by addressing its inability to interrogate 
cleavage site libraries in vitro to a depth sufficient to 
identify all possible off-target sites present in the human 
genome. To do this, we added a machine-learning-based 
step that uses cleavage site preferences from the in vitro 
selection experiments to predict what sequences in the 
human genome are most likely to be cleaved (Figure 1). 
We used standard machine-learning techniques to con- 
struct Naive Bayes classifiers that quantify how the nu- 
cleotide identity at each position within a DNA site differs 
between members of a partially degenerate library that 
were cleaved efficiently in vitro and those that were not 
('Materials and Methods' section). The scores generated 
by each classifier range from 0 to 1, with lower scores 
representing a higher probability that any given site will 
be cleaved ('Materials and Methods' section). 

We performed an initial test of our approach by de- 
veloping a classifier based on in vitro site selection data 
previously obtained for ZFNs targeted to a site in the 
human CCR5 gene. As shown in Supplementary 
Table SI, application of this CCR5 ZFN classifier to 
the human genome resulted in the overwhelming 
majority of potential target sites having a high classifier 
score: 11421 321 184 of 11421 337 066 potential sites 
(99.999861%) received a score higher than 0.75. By 
contrast, only 15 882 sites (0.000139% of all potential 
sites) had a score lower than 0.75, and only 1123 sites 
(0.00000983% of all potential sites) had a score below 
0.5. Importantly, all 12 bona fide off-target sites identified 
previously by the in vitro cleavage site selection, and the 
IDLV integration methods had scores below 0.75. In 
addition, 11 of these 12 sites fall within the top 25% 
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Figure 1. Schematic illustrating the original method by Pattanayak et al. (8) (blue arrows) and the enhanced approach that incorporates addition of 
a classifier-based step (green arrows). 



of sites with scores below 0.75 (Supplementary Table S2) 
(8-10). 

Having established classifier score cutoffs that enable 
identification of all previously known off-target sites for 
the CCR5-targeted ZFNs, we next prospectively tested 
whether other sites with scores below 0.75 might 
include additional bona fide off-target sites. However, a 
comprehensive analysis of all sites with scores below 0.75 
would require deep sequencing of 15 882 different alleles, 
an experiment that would be challenging and expensive 
to perform, given the current cost of next-generation 
sequencing. Therefore, we instead systematically 
assessed a smaller sampling of sites by first grouping 
them based on their position in exonic or non-exonic 
genomic sequence and then binning sites within each of 
these groups according to their classifier scores (i.e. — 0.0 
to 0.1, 0.1 to 0.2, etc.). To achieve high levels of nuclease 
activity that would facilitate detection of lower frequency 
off-target events, we used conditions described by Liu 
and colleagues to overexpress CCR5-targeted ZFNs in 
K562 cells ('Materials and Methods' section). We then 
used deep sequencing to assess the top 13 scoring sites 
(if available) within each bin for evidence of NHEJ- 
mediated indel mutations in the genomic DNA of these 
cells. 

Analysis of 138 sites identified NHEJ -mediated indel 
mutations not only at the intended CCR5 target site and 
at a previously known off-target site in the CCR2 gene but 
also at 21 new off-target sites (Table 1). As expected, the 
percentage of bona fide off-target sites found within each 
classifier score bin was inversely correlated with the mag- 
nitude of the score (i.e. — a greater percentage of actual 
off-target sites were identified in the lower score bins). 



For example, 35% (16 of 46) of the screened targets 
with scores in the first tercile (lowest scores) showed sig- 
nificant evidence of NHEJ-mediated indel mutations 
compared with 13% (6 of 46) and 2% (1 of 46) of sites 
with scores in the second and third terciles, respectively 
(Supplementary Table S3). 

To test the generalizability of our classifier-based 
approach, we used it to predict off-target sites for 
another pair of ZFNs targeted to the human VEGFA 
locus (Supplementary Table S4). Previous work using 
the in vitro cleavage site selection assay had identified a 
large number of potential off-target sites for this ZFN pair 
in human cells (Supplementary Table S5) (8). We used this 
selection data to build a classifier that we used to score 
every possible site in the human genome ('Materials and 
Methods' section). As we observed with the CCR5 classi- 
fier, only a small number (7242) of genomic sites had a 
classifier score below 0.75, and only 936 sites had a score 
below 0.5. In addition, all 31 bona fide off-target sites 
identified previously with the in vitro selection data all 
had scores below 0.6, with all but one of these sites 
having scores below 0.5 (Supplementary Table S6). We 
assessed 159 potential off-target sites (identified using 
the same stratified sampling approach we used for the 
CCR5 ZFNs) for evidence of off-target mutations from 
genomic DNA of human cells in which the VEGFA- 
targeted nucleases had been expressed. This systematic 
stratified analysis identified 34 bona fide off-target sites, 
including eight that were previously identified by 
Pattanayak et al. (8) and 26 that were novel (Table 2). 
We note that that the majority of these novel off-target 
sites had low classifier scores, again demonstrating the pre- 
dictive capability of our method (Table 2). Furthermore, 
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Table 1. Off-target sites for ZFNs targeted to CCR5 displaying significant evidence of ZFN induced indels grouped by 
classifier probability score 
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Table 2. Off-target sites for ZFNs targeted to VEGFA displaying significant evidence of ZFN induce NHEJ grouped by 
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several of the sites we predicted to be off-target sites that 
did not show a statistically significant level of NHEJ mu- 
tations in this study had been previously confirmed as 
off-targets when screened with a greater depth of 
sequencing reads by Pattanayak et al. (8), suggesting 
that a greater number of the predicted off-target sites 
might show evidence of mutation with deeper sequence 
sampling. 

DISCUSSION 

Our results show in silico abstraction of in vitro cleavage 
data provides a strategy that more broadly identifies all 
potential off-target sites of ZFN activity in human cells 
than previously described methods. Our classifier-based 
approach not only successfully re-identified all previously 
known off-target sites for two different ZFNs but also 
enabled the identification of many additional novel off- 
target sites, including some that differ from the target 
sequence by as many as 8 (of 24) or 6 (of 18) bp for the 
CCR5- or VEGFA-targeted ZFNs, respectively (Tables 3 
and 4) (8-10). Because sequences harboring these numbers 
of mismatches will occur frequently in the human genome, 



identifying these off-targets would have been previously 
intractable by simple mismatch counting approaches; 
indeed, using such strategies would require screening 
hundreds of thousands of potential sites (Supplementary 
Tables S7 and S8). 

Importantly, because we only assessed a small sampling 
of the top scoring potential off-target sites in cells, we 
believe that the full range of potential off-target sites for 
the two ZFN pairs we examined is likely more expansive 
than just those identified in this study. This expectation is 
supported by another experimental screen (data not 
shown) that identified six additional bona fide off- 
target sites for the CCR5-targeted ZFNs (Supplementary 
Table S9 and Supplementary Discussion). Collectively, 
these results clearly demonstrate that ZFN off-target 
sites may occur at low rates much more widely on a 
genome-wide scale than suggested by data from previously 
described reports. 

Although our data clearly demonstrate that sites with 
low classifier scores are highly enriched for bona fide off- 
target sites, our results also show that bona fide off- 
targets are present (albeit at a much lower frequency) 
among loci with higher classifier scores. This suggests 
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Table 3. In vivo validated off-targets for ZFNs targeted to CCR5 
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that comprehensive identification of off-target sites will 
require interrogation of a large number of loci by deep 
sequencing. We expect that decreases in the price per 
base and increases in the number of bases that can be 
sequenced should increase the number of potential sites 
with low classifier scores that can be examined, thereby 
enabling the identification of a greater number of bona 
fide off-target sites. However, until such reductions 
in sequencing costs become reality, an alternative 
approach might be to look at off-targets with the best 
scores or to pre-screen off-targets bioinformatically for 
sites that fall in regions of high priority such as 
promoters, exons and non-coding RNAs. 

We note that the number of off-target sites identified 
by our approach may be larger or smaller depending on 
the cell type examined as well as the level and duration 
of ZFN expression. Not all of the sites with low classifier 
scores we examined showed evidence of mutagenesis. 



Potential reasons for this might include DNA methyla- 
tion of the target site or chromatin status of the gene. 
These parameters will be cell-type specific and would not 
be accounted for by in vitro selections or in silico classi- 
fiers. As large-scale efforts such as ENCODE and the 
NIH Roadmap Epigenomics Mapping Project define 
these variables in multiple different cell types, it may 
be possible to use such information to prioritize sites 
with low classifier scores and thereby to increase the 
yield of bona fide off-target sites identified by deep 
sequencing. In addition, we expressed ZFNs from a 
strong constitutive CMV promoter using transiently 
transfected plasmids and harvested genomic DNA from 
cells 5 days post-transfection. Lower levels and shorter 
durations of ZFN expression might be expected to 
induce fewer off-target mutations, whereas higher levels 
and longer durations might induce an even greater 
number of such mutations. 
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Table 4. In vivo validated off-targets for ZFNs targeted to VEGFA 
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More broadly, the combined strategy of using 
in vitro cleavage site selection data together with 
machine-learning-based classifiers might also be extended 
to specificity information from other sources (e.g.-SELEX 
or bacterial selection) and to define the specificities of 



nucleases built on other platforms (e.g. — TALENs or 
CRISPR-Cas RNA-guided nucleases). The use of 
machine learning to improve the predictive power of 
data derived from in vitro selection experiments could be 
particularly useful for ZFNs composed of greater numbers 
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of fingers in each monomer and for TALENs. These nu- 
cleases target longer sites, making it challenging to ad- 
equately sample all potential off-targets even in an 
in vitro system. Continuing to better define off-target 
effects of targeted nucleases will provide important 
information to guide refinement of the genome-wide 
specificities of these reagents. These improvements will 
be critically important, as these targeted nucleases are 
more widely applied for both research and therapeutic 
approaches. 
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